Skip to main content

Similarity Lite

Description

Build the model based on input training data. Then by using this model, predict the answer based on input query in the specified output field. All processes are carried in one execution.

Note: Use the step when data is at minimum level based on the hardware configuration of the machine under consideration.

The step is used to find the most similar sentences to the input query against the input given in a single sentence, paragraph, or text.

Configurations

No.Field NameDescription
General tab
1Step NameSpecify the name of the step. Step names should be unique within a workflow.
2Number of Rows to ProcessSpecify total number of rows to be taken as input. (Default value: 500)
3Build using AE Model VersionSelect from the dropdown that which Python version to use for build and prediction purpose.
4QuerySpecify which column/features to be considered for building model.
5Top n resultsSpecify number of rows closest to the original answer to be fetched as output.
Field Mapping tab
1Feature / NameFeature or name used during model building step.
2Text PreprocessingPreprocessing options to be used to process the text/string. Please refer "Classification Model Builder's" step documentation.
3Target FieldSpecify output field name in which prediction value will be put.

When you are processing a feature of type string, as mentioned in ‘Text Processing’ section of above table, this feature needs to be converted into numeric features. Text Vectorization Tab governs how all string features get converted into numeric features. An n-gram is a contiguous sequence of n items from a given sample of text or speech. Table below shows how internally a string gets tokenized given different values of n-gram

No.StringN Gram Start/EndTokens
1Weather today is good1-1'Weather', 'today', 'good'
2Weather today is good1-2'Weather', 'today', 'good', 'Weather today', 'today good'
3Weather today is good1-3'Weather', 'today', 'good', 'Weather today', 'today good', 'Weather today good'
4Weather today is good2-3'Weather today', 'today good', 'Weather today good'

*is treated as stop word and not considered

No.Field NameDescription
Text Vectorization Tab
1N Gram startShould be a numeric value with minimum of 1
2N Gram endShould be a numeric value greater than or equal to N Gram start
3VectorizationN-Gram operation tokenizes input string feature. Vectorization is the operation where these tokens are converted to numeric features which are needed by the algorithms. There are three types of vectorizers supported
- Count Vectorizer: It counts the number of times a token shows up in the document and uses this value as its weight.
Tfidf Vectorizer: TF-IDF stands for “term frequency-inverse document frequency”, meaning the weight assigned to each token not only depends on its frequency in a document but also how recurrent that term is in the entire corpora.
- Hashing Vectorizer: It is designed to be as memory efficient as possible. Instead of storing the tokens as strings, the vectorizer applies the hashing trick to encode them as numerical indexes. The downside of this method is that once vectorized, the features’ names can no longer be retrieved.